Lets start by looking at the data summary
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From this summary we can see some broad categories like: acidity, sugar, chemical groups, quality, alcohol content.
Lets start by plotting the quality This looks like a normal distribution.
To continue this analysis further, lets look at the: density, alcohol levels and sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The density looks like a normal distribution and the alcohol data is a little skewed. We can see a large spike in the alcohol level around 9.5%.
Sugar seems to be skewed drastically, it would make sense to test it on a log scale.
Nothing significant can be seen here.
Now, lets look at the acidity
pH seems to follow a normal distribution, with the largest concentration around 3.3.
Looks like the fixed and volatile acidity seems to skewed. But, no pattern is visible in case of the citric acid levels. So, lets further explore it.
It seems skewed when measured on a log scale.
Finally, lets explore the chemical levels
These plots look like normal distributions if we remove the outliers.
Both distributions are skewed. # Univariate Analysis
The are 1599 different wine bottles and the dataset has 13 features (“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”).
Some interesting observations: * Majority of the wines are rate a quality of 5 or 6. * The alcohol levels are skewed with a large spike at 9.5%. * The median pH values is at 3.31.
The main feature in this dataset is the quality.
The main features of interest are citric.acid, residual.sugar, ph and alcohol. It would be interesting to see how these variables effect the quality.
No.
Citric acid and Alcohol seem to be a little unusual. Alcohol seems to have a skewed distribution with a sudden did, it’s looks almost bimodal. While citric acid is skewed on the log scale along the x axis.
No aditional changes were made.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.256130895 0.6717034 0.114776724
## volatile.acidity -0.25613089 1.000000000 -0.5524957 0.001917882
## citric.acid 0.67170343 -0.552495685 1.0000000 0.143577162
## residual.sugar 0.11477672 0.001917882 0.1435772 1.000000000
## density 0.66804729 0.022026232 0.3649472 0.355283371
## pH -0.68297819 0.234937294 -0.5419041 -0.085652422
## alcohol -0.06166827 -0.202288027 0.1099032 0.042075437
## quality 0.12405165 -0.390557780 0.2263725 0.013731637
## density pH alcohol quality
## fixed.acidity 0.66804729 -0.68297819 -0.06166827 0.12405165
## volatile.acidity 0.02202623 0.23493729 -0.20228803 -0.39055778
## citric.acid 0.36494718 -0.54190414 0.10990325 0.22637251
## residual.sugar 0.35528337 -0.08565242 0.04207544 0.01373164
## density 1.00000000 -0.34169933 -0.49617977 -0.17491923
## pH -0.34169933 1.00000000 0.20563251 -0.05773139
## alcohol -0.49617977 0.20563251 1.00000000 0.47616632
## quality -0.17491923 -0.05773139 0.47616632 1.00000000
Lets draw a correlation plot to have a better understaing.
From the above table and plot matrix we see “fixed.acidity”, “volatile.acidity” and “pH” has some correlation with “citric.acid”. Interestingly, density has some correlation with “fixed.acidity” and “alcohol”. Also, “quality” has some correlation with “alcohol”.
Lets now look at pH, fixed.acidity and volatile.acidity versus citric.acid.
##
## Call:
## lm(formula = citric.acid ~ pH, data = analysis_winedata)
##
## Coefficients:
## (Intercept) pH
## 2.5350 -0.6838
From the scatter plot we can see that the data seems to be slightly negatively correlated.
##
## Call:
## lm(formula = citric.acid ~ fixed.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) fixed.acidity
## -0.35427 0.07515
From the scatter plot we can see that the data seems to be slightly positively correlated.
##
## Call:
## lm(formula = citric.acid ~ volatile.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) volatile.acidity
## 0.5882 -0.6011
This data looks very similar to pH vs citric acid levels. Maybe pH and volatile.acidity have some relationship. Let’s try to plot it.
##
## Call:
## lm(formula = pH ~ volatile.acidity, data = analysis_winedata)
##
## Coefficients:
## (Intercept) volatile.acidity
## 3.2042 0.2026
There definitly seems to be some sort of correlation here.
Now, lets look at denisty vs alcohol and density vs fixed.acidity.
##
## Call:
## lm(formula = density ~ alcohol, data = analysis_winedata)
##
## Coefficients:
## (Intercept) alcohol
## 1.0059059 -0.0008788
The general trend here seems to be that alcohol levels decrease with density. Which does make sense as alcohol is lighter than water and more alcohol means less water, hence lower density.
There is a clearcut linear relationship between fixed acidity and density. The acidity goes up with the density.
Now, lets more to the most interesting plot between alcohol and quality.
## $`3`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10 9.96 0.82 9.93 10.02 0.78 8.4 11 2.6 -0.41 -0.99 0.26
##
## $`4`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 53 10.27 0.93 10 10.21 1.19 9 13.1 4.1 0.61 -0.23
## se
## X1 0.13
##
## $`5`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 681 9.9 0.74 9.7 9.79 0.44 8.5 14.9 6.4 1.83 5.25
## se
## X1 0.03
##
## $`6`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 638 10.63 1.05 10.5 10.56 1.19 8.4 14 5.6 0.54 -0.16
## se
## X1 0.04
##
## $`7`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 199 11.47 0.96 11.5 11.47 1.04 9.2 14 4.8 0.01 -0.47
## se
## X1 0.07
##
## $`8`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 18 12.09 1.22 12.15 12.12 1.19 9.8 14 4.2 -0.2 -0.98 0.29
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
There seems to be a positive correlation, except in the case of wines rates 5 in quality.
Most of the comparisons made with citric acid showed some type of linear realtionship.
The comparision between alcohol and density proved the hypothesis that wines having low alcohol levels have high concentration of water, hence lower higher in density as water is more dense.
Finally, quality and alcohol showed an increasing linear relationship. But, there is a suddent dip in case of wine with quality ‘5’.
As mentioned above the dip in quality vs alcohol is very intersting.
pH and fixed acidity seem to have the strongest correlation.
In the above plot of Alcohol vs Density vs Quality. We can see that alcohols rated 5 in quality are on the more denser while having low alcohol content.
No significant observations can be derived from this plot.
There are no interesting patterns here.
Clearly acidity varies negatively with the pH. But, the quality seems to be uniform.
From the first graph it seems to be that the density has a inverse relationship with quality. Denser the wine, lower it’s score.
No.
Their seems to be spikes in the citric acid instead of the expected normal distributions.
This boxplot shows how quality varies with alcohol level with a dip at ‘5’.
The take aways from this analysis are that wines with high quality tend to have higher alcohol content and low residual sugar. Another interesting finding was that citric acidity decreases with pH levels. So, wines with lower acidty have higher citric acid content.
In conclusion, if you are looking for a good bottle of wine. It will most like have very little sweetness to it, but will be strong.